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RESEARCH  SUMMARY 

Introduced  here  is  the  concept  of  a  proxy  set  which 
we  define  to  be  a  collection  of  potential  explanatory  vari- 
ables linearly  related  to  one  another.  Therefore,  each 
member  of  the  proxy  set  conveys  some,  and  perhaps 
much,  of  the  same  information  as  other  members  of  the 
same  proxy  set  if  they  are  included  in  a  linear  regression 
model  together.  Therefore,  interpreting  a  coefficient  in  a 
multiple  regression  equation  can  be  misleading  if  proxy- 
set  membership  is  ignored.  All  potential  explanatory 
variables  should  be  examined  before  a  linear  regression 
model  is  constructed  to  see  if  some  variables  belong  to 
proxy  sets.  Accounting  for  those  proxies  not  included  in 
the  model  as  well  as  those  that  are  included  permits  a 


more  realistic  interpretation  of  the  coefficients  in  the  fi- 
nal regression  model.  Seven  diagnostic  techniques  are 
discussed:  the  correlation  matrix  method,  the  iterative 
variance  inflation  factor  method  (introduced  here  for  the 
first  time),  the  variance  decomposition  method,  principal 
components  without  rotation,  principal  components  with 
rotation,  factor  analysis,  and  cluster  analysis.  The  ef- 
fectiveness of  these  seven  methods  in  identifying  proxy 
sets  is  examined  using  data  with  known  proxy  set  struc- 
ture. The  iterative  variance  inflation  factor  and  the  vari- 
ance decomposition  methods  were  the  best  overall  per- 
formers; factor  analysis  was  the  worst.- 
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BVTRODUCTION 

Multiple  linear  regression  is  used  extensively  when 
analyzing  data  from  natural  resource  studies.  Many 
natural  resource  workers  do  not  limit  use  of  the  re- 
gression equations  to  prediction.  Rather,  they  inter- 
pret the  estimated  regression  coefficients,  and  these 
interpretations  become  the  basis  for  far-reaching  policy 
recommendations  and  management  decisions. 

Regression  analyses  are  usually  expressed  in  terms 
of  a  dependent  variable,  which  we  call  a  response  vari- 
able. Likewise,  we  use  the  term  "explanatory  variable" 
to  indicate  what  is  often  called  an  independent  vari- 
able. Our  reason  for  not  using  the  terms  "dependent" 
and  "independent"  is  to  avoid  confusion  when  we  dis- 
cuss mathematical  independence  (orthogonality),  a 
condition  that  plays  a  critical  role  in  this  article. 

Regression  coefficients  are  commonly  interpreted 
as  representing  the  change  in  the  response  variable 
caused  by  a  one-unit  increase  in  the  corresponding 
explanatory  variable,  with  all  other  explanatory  vari- 
ables held  constant.  This  is  tantamount  to  taking  a 
partial  derivative  of  an  equation  with  respect  to  a  spe- 
cific explanatory  variable  and  interpreting  it.  This 
procedure  has  several  serious  problems:  (1)  a  cause- 
and-effect  relationship  is  not  inherent  in  regression 
analysis;  (2)  some  explanatory  variables  (such  as 
weather)  cannot  be  held  constant;  (3)  interpretation 
of  regression  coefficients  must  take  into  account  other 
explanatory  variables  in  the  model  (the  traditional 
collinearity  problem);  and  (4)  interpretation  of  the  re- 
gression coefficients  must  also  take  into  account  other 
explanatory  variables  that  are  not  in  the  model.  Com- 
mon mistakes  involving  items  (1)  and  (2)  are  discussed 
at  length  in  the  literature  (Hocking  1976;  Mosteller 
and  Txikey  1977;  Draper  and  Smith  1981).  Belsley 
and  others  (1980)  discuss  the  traditional  collinearity 
problem,  item  (3).  In  this  article,  we  concentrate  on 
the  importance  of  explanatory  variables  that  are  not 
in  the  model,  item  (4),  although  item  (3)  is  inherent 
in  our  discussions. 


THE  PROBLEM 

The  explanatory  variables  finally  included  in  a  re- 
gression model  are  often  selected  from  a  larger  set  of 
potential  explanatory  variables.  Some  are  truly  inde- 
pendent of  (orthogonal  to)  each  other  and  pose  no  diffi- 
culty either  in  model  building  or  interpretation.  Other 
explanatory  variables  go  together,  occurring  as  pack- 
ages or  associations.  We  call  these  packages  proxy 
sets.  We  define  a  proxy  set  as  a  collection  of  explana- 
tory variables,  any  one  of  which  conveys  some  of  the 
same  information  as  any  of  the  other  variables  in  the 
set.  In  a  sense,  each  variable  in  a  proxy  set  represents 
all  other  variables  in  the  same  set,  at  least  partially; 
each  such  variable  could  conceivably  serve  as  a  proxy 
for  the  entire  set.  For  example,  average  leaf  length  and 
average  leaf  width  may  belong  to  a  proxy  set  convey- 
ing the  effect  of  leaf  size  on  a  response  variable  such 
as  total  biomass.  If  an  explanatory  variable  is  not  a 
member  of  any  proxy  set,  we  call  it  a  nonproxy  vari- 
able. Proxy-set  membership  has  far-reaching  impli- 
cations for  the  interpretation  of  regression  coefficients. 

An  example  illustrates  the  problem  of  proxy-set 
membership.  Consider  the  effects  of  four  explanatory 
variables  on  the  ratio  of  bract  width  to  scale  width 
(RATIO)  in  larch  cones:  wet  cone  length  (WETLEN), 
wet  cone  width  (WETWID),  dry  cone  length  (DRYLEN), 
and  dry  cone  width  (DRYWID).  At  this  point,  we  wish 
to  draw  attention  to  only  two  of  these  four  potential 
explanatory  variables:  WETLEN  and  DRYLEN.  The 
model  considered  is: 

RATIO  =  (Bract  width)/(Scale  width)  =  Pq  +  Pi 
WETLEN  +  p2  DRYLEN  +  P3  WETWID  + 
DRYWID  +  e 

The  estimate  of  p^,  the  coefficient  of  WETLEN,  is 
0.000705  (t  =  0.47  with  P  =  0.636).  The  estimate  of 
P2,  the  coefficient  of  DRYLEN,  is  -0.001780  (t  =  -1.13 
with  P  =  0.261).  The  small  ^-values  and  correspond- 
ing large  probabilities  for  the  coefficient  estimates  of 
both  WETLEN  and  DRYLEN  would  seem  to  suggest 
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that  neither  of  these  variables  has  a  potential  for  ex- 
plaining the  bract/scale  ratio.  In  fact,  faced  with  com- 
parable evidence,  many  published  linear  regression 
analyses  have  drawn  similar  conclusions. 

In  reality,  we  would  be  in  error  if  we  accepted  the 
interpretation  that  neither  WETLEN  nor  DRYLEN 
is  important.  This  is  seen  easily  by  fitting  the  same 
model  as  before,  but  with  DRYLEN  removed  fi-om  the 
model.  The  estimate  of  the  coefficient  of  WETLEN 
becomes  -0.000898  (t  =  -2.06  and  P  =  0.0404),  a  sig- 
nificant result,  in  direct  contrast  to  the  nonsignificant 
result  obtained  when  both  WETLEN  and  DRYLEN 
were  in  the  model.  Likewise,  fitting  the  model  with 
DRYLEN  present,  but  with  WETLEN  removed, 
produces  an  estimated  coefficient  for  DRYLEN  of 
-0.001064  (t  =  -2.30  and  P  =  0.0220).  Again,  this 
result  is  in  direct  contrast  with  the  result  obtained 
when  both  WETLEN  and  DRYLEN  were  included 
in  the  model. 

Either  WETLEN  or  DRYLEN  can  describe  the  bract/ 
scale  ratio  if  the  other  is  absent  fi^om  the  model.  It 
would  have  been  a  mistake  to  have  concluded  that 
neither  variable  is  useful  in  describing  the  bract/scale 
ratio.  WETLEN  and  DRYLEN  are  members  of  the 
same  proxy  set.  Either  could  be  used  to  represent  the 
effect  of  length  on  the  bract/scale  ratio.  In  such  cases, 
a  given  regression  coefficient  might  be  thought  of  as 
representing  the  effect  of  the  entire  proxy  set  on  the 
response  variable.  This  example  illustrates  three 
characteristics  of  proxy  sets: 

1.  If  more  than  one  member  of  a  proxy  set  is  included 
in  a  regression  model,  all  members  of  that  set  included 
in  the  model  appear  less  important  and  may  even  be 
statistically  "nonsignificant." 

2.  To  some  extent,  each  member  of  a  proxy  set  is 
capable  of  "standing  in"  for  all  members  of  the  set. 
Thus,  if  a  proxy  set  is  important,  at  least  one  (and 
possibly  only  one)  member  of  the  set  is  needed  in  the 
regression  model. 

3.  If  a  member  of  a  proxy  set  is  included  in  a  model, 
its  statistical  "significance"  imphes  that  the  entire 
proxy  set  is  "significant."  This  includes  all  variables 
in  the  set — whether  they  are  included  in  the  model 

or  not. 

Traditional  regression  diagnostic  tools  focus  exclu- 
sively on  the  explanatory  variables  included  in  the 
proposed  model.  Here,  we  focus  on  the  problems  of 
interpreting  regression  coefficients  corresponding  to 
explanatory  variables  that  are  proxies  of  other  explana- 
tory variables,  some  of  which  may  not  have  been  in- 
cluded in  the  model.  This  leads  to  two  important 
questions:  (1)  "What  techniques  can  be  used  to  iden- 
tify proxy  sets?"  and  (2)  "Do  the  techniques  correctly 
identify  proxy  sets?" 


PROXY-SET  roENTIFICATION 
METHODS 

Variables  that  are  coUinear  with  one  another  are 
members  of  the  same  proxy  set.  Therefore,  the  prob- 
lem of  identifying  proxy  sets  is  essentially  the  problem 
of  identifying  collinear  veiriables.  Often  a  variable 
included  in  a  model  is  a  member  of  a  proxy  set  con- 
taining several  variables  that  do  not  appear  in  the 
model.  The  point  we  emphasize  is  that  these  proxy 
variables  that  are  not  in  the  final  model  must  also 
be  considered  when  interpreting  the  coefficients  that 
remain  in  the  model.  The  usual  methods  for  identify- 
ing collinearity  among  model  variables  after  the  model 
is  fitted  fail  to  consider  possible  proxy  variables  that 
did  not  make  it  into  the  model. 

Over  the  years,  several  methods  have  been  proposed 
for  diagnosing  coUinearity  in  a  fitted  model.  Some  of 
these  methods  can  also  be  used  to  identify  proxy  sets 
containing  proxies  that  did  not  make  it  into  the  final 
model.  We  will  discuss  several  of  these  methods  and 
will  introduce  a  particularly  powerful  method  that 
has  not  appeared  previously  in  the  hterature.  In  aU 
cases,  proxy  sets  are  to  be  identified  before  fitting  the 
model.  Thus,  the  methods  we  propose  will  consider 
variables  that  never  become  part  of  the  final  model. 

The  effectiveness  of  each  of  the  proposed  methods 
depends  on  the  decision  criteria  the  user  employs. 
These  criteria  take  the  form  of  numerical  limits  or 
cutoff's  which,  when  exceeded,  indicate  the  presence 
of  proxies.  Unfortunately,  there  is  no  firm  theoretical 
reason  for  choosing  specific  cutoffs.  We  agree  with 
Baskerville  and  Toogood  (1982)  that  "it  is  difficult  and 
perhaps  inappropriate  to  give  general  rules  since  a 
prehminary  exploratory  analysis  should  be  flexible." 
Nevertheless,  the  identification  of  proxy  sets  demands 
that  we  choose  some  cutoffs.  We  have  attempted  to 
follow  recommendations  in  the  literature  when  possi- 
ble, but  in  cases  where  no  recommendations  can  be 
found,  we  suggest  some  numerical  cutoffs  we  find  use- 
ful. In  fact,  the  cutoffs  we  suggest  are  the  ones  we 
used  to  obtain  the  results  presented  in  this  article. 
Our  suggested  cutoffs  appear  at  the  end  of  the  de- 
scription of  each  method.  The  cutoffs  we  did  not  find 
in  the  hterature  were  determined  during  our  research 
when  we  knew,  by  carefiil  construction  of  the  data 
sets,  which  variables  belonged  to  proxy  sets.  The  cut- 
offs we  suggest  are  those  that  gave  the  specific  identi- 
fication method  a  fair  opportunity  of  identifying  the 
proxy  sets  we  knew  to  be  correct.  In  some  cases,  we 
chose  cutoffs  based  on  other  considerations.  Never- 
theless, our  suggested  cutoffs  are  strictly  empirical, 
and  users  of  these  methods  may  wish  to  select  their 
own  numerical  cutoffs.  More  details  on  our  choices 
of  cutoffs  are  found  in  appendix  A. 
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The  Correlation  Matrix  Method 

Examining  the  elements  of  the  correlation  matrix, 
R,  is  probably  the  oldest  method  for  detecting  linear 
relationships  among  variables.  It  has  been  used  ex- 
tensively. Its  main  limitation  is  its  inability  to  detect 
relationships  between  more  than  two  variables  at  a 
time.  Inspection  of  correlation  coefficients  often  fails 
to  detect  relationships  involving  several  variables,  es- 
pecially when  all  relationships  between  pairs  of  vari- 
ables are  fairly  weak.  Another  shortcoming  is  the  dif- 
ficulty of  keeping  track  of  many  variables  at  once.  On 
the  other  hand,  the  correlation  matrix  method  (CORR) 
is  easy  to  use  and  is  generally  available.  In  practice, 
users  simply  look  for  large  positive  or  negative  corre- 
lation coefficients. 

If  the  absolute  value  of  a  correlation  coefficient  was 
greater  than  0.5,  we  considered  the  two  variables  to 
be  proxies  of  one  another,  belonging  to  the  same  proxy 
set.  Likewise,  if  the  absolute  values  of  all  correlation 
coefficients  involving  a  specific  variable  were  less  than 
0.32,  we  considered  that  variable  to  be  a  nonproxy. 
Variables  with  correlation  coefficients  between  0.32 
and  0.5  were  not  specified  as  proxies  or  as  nonproxies. 

We  cannot  identify  proxy  sets  of  more  than  two  var- 
iables by  using  CORR  alone.  However,  we  established 
a  rule  that  if  CORR  identified  all  lower-order  proxy 
sets,  we  considered  it  to  have  identified  the  higher- 
order  proxy  set  as  well.  For  example,  a  three-variable 
proxy  set  might  involve  variables  x,  y,  and  z.  If  the 
proxy  sets  {x,^'),  [x,z}  and  \y,z\  are  all  identified,  we 
considered  the  three-variable  set  {x,y,z]  to  be  identi- 
fied also. 

Iterative  Variance  Inflation  Factor 
Method 

Variance  inflation  factors  (VIF's)  are  the  diagonal 
elements  of  R-^  the  inverse  of  the  correlation  matrix 
(Belsley  1991,  pp.  27-28).  Here  we  introduce,  for  the 
first  time,  an  algorithm  based  on  a  modification  of  the 
VIE.  This  algorithm  is  particularly  useful  for  identi- 
fying proxy  sets,  but  it  can  also  be  used  during  the  fit- 
ting stage  of  regression  model-building.  When  used 
in  the  latter  capacity,  it  permits  the  user  to  determine 
which  sets  of  explanatory  variables  are  collinear  with 
one  another.  Thus,  it  aids  the  user  in  building  mod- 
els with  "relatively  unrelated"  explanatory  variables. 

The  VIE,  is  usually  appUed  in  a  static  manner.  Then 
it  is  capable  only  of  identifying  which  explanatory  vari- 
ables are  involved  in  collinearities.  It  does  not  specify 
which  variables  are  collinear  with  which  other  vari- 
ables. A  VIE-based  method  capable  of  helping  us 
identify  the  specific  groups  of  variables  that  are  col- 
linear with  one  another  would  be  even  more  useful. 
Such  a  method  would  have  to  use  the  VIE  in  a  dynamic, 
iterative  manner.  We  present  the  details  of  such  a 


method,  the  iterative  variance  inflation  factor  (IVIF) 
method,  in  appendix  B.  In  general,  the  method  is 
based  on  entering  variables  one  at  a  time,  so  the  be- 
havior of  the  entire  set  of  VEE's  is  evaluated. 

If  the  VIE  of  any  variable  (already  in  the  model) 
jumped  to  a  value  greater  than  1.5  when  a  new  vari- 
able was  introduced,  we  considered  it  to  be  a  proxy 
of  the  newly  entered  variable.  Those  variables  with 
VIE's  near  1  can  be  considered  nonproxy  variables. 

Variance  Decomposition  Method 

Variance  decomposition  (VDC)  is  a  regression  diag- 
nostics technique  that  is  becoming  more  readily  avail- 
able in  statistical  analysis  computer  programs  each 
year.  It  has  proven  itself  to  be  of  considerable  value 
in  detecting  collinearity.  As  described  in  Belsley  and 
others  (1980),  VDC  specifically  identifies  those  vari- 
ables that  are  linearly  related  to  one  another. 

VDC  expresses  each  variance  component  as  a  pro- 
portion of  the  total  variance  for  a  given  regression 
coefficient.  Therefore,  the  total  of  all  the  proportions 
for  a  given  regression  coefficient  will  equal  1.  This 
method  can  readily  identify  proxy  sets  containing  more 
than  two  variables. 

Following  the  suggestion  of  Belsley  and  others  (1980) 
for  identification  of  collinear  variables,  when  two  or 
more  proportions  corresponding  to  the  same  eigenvalue 
were  greater  than  0.5,  we  considered  the  correspond- 
ing variables  to  be  proxies.  If  the  variance  proportions 
were  all  less  than  0.5,  we  considered  the  corresponding 
variable  to  be  a  nonproxy.  Thus,  if  there  were  four 
variables  with  proportions  greater  than  0.5  for  the 
same  eigenvalue,  all  four  were  considered  members 
of  the  same  proxy  set. 

Principal  Components  Methods 

The  use  of  principal  components  anedysis  (PC)  is  not 
new  in  natural  resources  (Isebrands  and  Crow  1975). 
However,  the  usual  applications  emphasize  the  prin- 
cipal components  corresponding  to  the  largest  eigen- 
values. These  principal  components  leave  the  least 
unexplained  variability.  In  contrast,  we  suggest  fo- 
cusing on  the  principal  components  associated  with 
the  smallest  eigenvalues  in  order  to  identify  proxy 
sets.  Small  eigenvalues  are  associated  with  linearly 
related  variables;  therefore,  smaller  eigenvalues  re- 
sult from  collinearity. 

Principal  components  can  be  rotated  by  appl3dng  a 
linear  transformation  to  them.  Sometimes,  this  type 
of  rotation  can  help  us  interpret  the  components. 
Many  rotations  could  be  performed,  but  we  consider 
only  the  Varimax  rotation,  which  leaves  the  rotated 
principal  components  (RPC)  orthogonal  to  one  another. 
Among  the  many  sources  of  additional  information 
on  PC  and  RPC  are  Morrison  (1967)  and  Dillon  and 
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Goldstein  (1984).  Latent  root  regression  (Hawkins 
1973;  Sharma  and  James  1981;  Webster  and  others 
1974)  is  based  on  methods  closely  related  to  our  use 
of  principal  components.  Baskerville  and  Toogood 
(1982)  propose  a  classification  system  for  variables 
and  suggest  use  of  latent  root  regression  as  part  of 
a  whole  model  -building  philosophy  they  call  "guided 
regression." 

Without  rotation,  one  merely  examines  the  coeffi- 
cients within  each  principal  component  and  looks  for 
possible  groupings.  After  rotation,  one  looks  at  the 
coefficients  within  a  given  principal  component;  any 
variables  with  a  coefficient  greater  than  0.32  are  con- 
sidered members  of  the  same  proxy  set.  If  either  PC 
or  RFC  identified  only  a  single  variable,  it  was  defined 
to  be  a  nonproxy. 


Data  set  A  Data  set  B 


Data  set  C  Data  set  D 

Figure  1 — Graphic  representation  of  the  struc- 
ture of  the  four  data  sets  used  in  comparing  the 
efficiencies  of  the  seven  proxy-set  identification 
methods.  There  are  1 0  variables  in  each  data 
set.  Variable  X,  is  represented  by  1 ,     is  rep- 
resented by  2,  etc.  Variables  in  the  same  proxy 
set  are  enclosed  in  the  same  ellipse. 


Factor  Analysis  Method 

Most  factor  analyses  are  performed  on  the  correla- 
tion matrix,  R  (Dillon  and  Goldstein  1984,  pp.  63-68). 
We  recommend  doing  so  when  identifying  proxy  sets. 
A  factor  analysis  is  performed  on  aU  the  variables  that 
are  candidates  for  inclvision  in  the  regression  model. 

As  with  PC,  factor  analysis  (FA)  also  allows  rotation 
of  the  axes  to  permit  better  identification  of  some  as- 
pects of  the  data.  When  using  factor  analysis  to  iden- 
tify members  of  a  proxy  set,  the  first  factors,  corre- 
sponding to  the  largest  eigenvalues,  are  most  Hkely 
to  produce  usable  proxy  sets.  Within  a  given  factor, 
any  variables  with  a  loading  greater  than  0.32  were 
considered  members  of  the  same  proxy  set.  We  only 
recommend  the  use  of  FA  with  the  Varimax  rotation. 

Cluster  Analysis  Method 

Cluster  analysis  (CLUSTER)  is  not  a  single  method, 
but  a  collection  of  methods  (Hartigan  1975;  Romesbxirg 
1984).  These  methods  are  commonly  used  in  their 
traditional  way  in  natural  resources  research  (Turner 
1974).  Their  traditional  use  is  to  attempt  to  group,  or 
"cluster,"  individuals  on  the  basis  of  similarities  in  a 
set  of  measurements  made  on  each  individual.  Using 
CLUSTER  to  identify  proxy  sets  requires  exchanging 
the  roles  of  the  individuals  and  the  variables.  In  other 
words,  the  variables  are  grouped,  or  "clustered,"  on 
the  basis  of  their  response  on  the  various  individuals. 
We  always  use  the  hierarchical  method  of  clustering. 
Cluster  analysis  output  identifies  proxy  sets  directly. 
If  a  variable  was  not  found  in  a  proxy  set  (that  is, 
it  was  not  found  in  a  cluster),  it  was  considered  a 
nonproxy. 

STUDY  DESIGN 

This  study  was  designed  to  determine  the  effective- 
ness of  seven  identification  techniques  used  in  proxy- 
set  identification.  We  define  effectiveness  as  the  per- 
centage of  existing  proxy  sets  identified  correctly  by 
an  identification  technique. 

The  effectiveness  of  an  identification  technique  can 
only  be  measured  if  we  know  which  proxy  sets  really 
exist.  Otherwise,  there  is  no  basis  for  comparison. 
Therefore,  all  seven  techniques  were  appUed  to  each 
of  four  different  data  sets.  Each  data  set  was  gener- 
ated to  contain  several  known  proxy  sets.  Proxy  sets 
had  one,  two,  three,  or  four  variables.  Each  of  the  four 
data  sets  consisted  of  100  observations  on  10  variables. 
The  proxy-set  structures  are  illustrated  in  figure  1, 
where  variable     is  indicated  by  the  number  1,  by 
2,  and  so  forth.  All  variables  with  identification  nimi- 
bers  enclosed  in  the  same  ellipse  are  members  of  the 
same  proxy  set.  For  example,  there  are  three  proxy 
sets  in  data  set  A:  two  sets  with  three  members  each, 
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and  one  set  with  two  members.  In  addition,  there  are 
two  nonproxy  variables. 

We  devised  a  scoring  system  to  compare  the  differ- 
ent methods'  efficiencies  in  identifying  proxy  sets  cor- 
rectly. To  develop  the  score,  all  possible  subsets  of 
variables  within  each  true  proxy  set  were  enimierated. 
For  example,  the  proxy  set  in  data  set  A  containing 
variables  X^,  X^,  and  X^,  which  we  denote  by  {X^,  X^, 
Xg},  has  four  subsets  of  interest:  {X^,  X^},  {X^,  X^},  {X^, 
Xg},  and  {Xj^,  X^,  X^].  These  are  the  subsets  in  which 
a  given  variable  appears  correctly  identified  with  at 
least  one  other  member  of  its  true  proxy  set.  We  refer 
to  these  subsets  as  "envuneration  sets."  Also,  we  refer 
to  a  list  of  all  these  subsets,  fi-om  all  the  true  proxy  sets 
in  a  data  set,  as  the  "true  enumeration  hst"  for  the  data 
set.  All  nonproxy  variables  are  also  included  in  the 
enumeration  hst,  because  it  is  important  to  identify 
them  as  well.  As  an  example,  data  set  A  includes  two 
proxy  sets  that  contain  three  variables  each.  Four 
enumeration  sets  are  obtained  fi-om  each  of  these  two 
proxy  sets  (a  total  of  eight  enumeration  sets).  Also, 
data  set  A  includes  a  proxy  set  containing  only  two 
variables.  A  single  eniuneration  set  is  obtained  from 
this  two-variable  proxy  set.  Finally,  there  are  two  non- 
proxy  variables  that  contribute  one  variable  each  to 
the  enumeration  list.  Thus,  the  enumeration  list  con- 
tains a  total  of  11  enumeration  sets  (8  +  1  +  1  +  1). 

The  variables  for  the  data  sets  were  generated  by 
one  of  the  authors.  A  different  author  who  did  not 
know  which  variables  belonged  to  proxy  sets  applied 
the  seven  identification  techniques. 

For  example,  each  of  the  identification  methods  was 
appUed  to  data  set  A.  A  separate  envmieration  hst  was 
prepared  for  each  of  the  methods.  Each  list  was  com- 
piled from  the  proxy  sets  identified  by  the  correspond- 
ing method.  A  method  scored  one  point  for  each  set 
in  its  enumeration  list  corresponding  to  a  set  in  the 
true  enumeration  Ust  of  data  set  A.  The  rationale  of 
the  scoring  system  is  that  a  point  is  given  for  every 
partially  correct  proxy-set  identification.  The  scoring 
system  simply  gives  the  identification  method  a  score 
equal  to  the  percentage  of  enumeration  sets  in  data 
set  A  that  were  correctly  found  when  the  technique  was 
applied.  Data  sets  B,  C,  and  D  were  evaluated  in  the 
same  way. 

The  proxy  sets  generated  for  this  study  contained 
variables  having  known  linear  relationships  with 
one  another  that  ranged  from  weak  to  strong.  There- 
fore, some  measure  of  the  strength  of  relationships 
within  a  proxy  set  seemed  useful.  We  chose  the  mean 
of  the  absolute  values  of  the  correlation  coefficients 
for  all  possible  pairs  of  variables  within  a  proxy  set. 
We  call  this  measure  the  "average  correlation."  As  an 
example,  for  a  proxy  set  containing  X^,  X^,  and  X^  we 
can  compute  three  correlation  coefficients:  between 
and  Xg,  between  X^  and  Xg,  and  between  Xg  and  Xg. 


Our  measure  of  strength  for  the  relationships  among 
the  variables  in  this  proxy  set  is  the  mean  of  the  ab- 
solute values  of  these  three  correlation  coefficients. 

Logistic  regression  (Hosmer  and  Lemeshow  1989; 
McCuUagh  and  Nelder  1989)  was  used  to  describe 
two  relationships:  (1)  between  the  proportion  of  enu- 
meration sets  identified  correctly  and  average  corre- 
lation and  (2)  between  the  proportion  of  enumeration 
sets  identified  correctly  and  the  number  of  variables 
in  the  proxy  set.  For  this  analysis,  nonproxies  were 
not  used. 

RESULTS 

We  computed  the  effectiveness  of  the  seven  meth- 
ods for  identifying  proxy  sets,  based  on  the  four  data 
sets  generated  with  100  observations  each.  The  four 
data  sets  studied  included  23  known  associations:  one 
foxu"- variable  proxy  set,  four  three- variable  proxy  sets, 
six  two-variable  proxy  sets,  and  12  nonproxy  variables. 
From  these  proxy  sets,  we  constructed  the  enumera- 
tion sets  defined  earUer.  From  data  set  A  we  obtained 
11  enumeration  sets,  from  data  set  B  we  obtained  9 
sets,  from  data  set  C  we  obtained  16  sets,  and  from 
data  set  D  we  obtained  9  sets.  This  gives  us  a  total 
of  45  enumeration  sets.  We  define  overall  effective- 
ness to  be  the  percent  of  the  enumeration  sets  that 
were  correctly  identified  out  of  the  total  of  45  possible. 

Table  1  shows  substantial  differences  in  effective- 
ness for  the  seven  proxy-set  identification  techniques. 
The  "Total"  column  provides  an  overall  measure  of  each 
method's  effectiveness  in  correctly  identifying  proxy 
sets.  A  method  that  identified  all  proxy  sets  correctly 
would  have  a  score  of  100  percent.  In  this  study,  ef- 
fectiveness scores  ranged  from  a  high  of  about  91  per- 
cent for  the  rVlF  method  down  to  47  percent  for  FA. 

Our  results  are  not  extensive  enough  to  differentiate 
clearly  among  methods  that  perform  about  equally 
well.  Therefore,  to  provide  practical  guidelines,  we 
chose  to  place  the  seven  methods  in  four  groups  based 
on  their  effectiveness  scores.  The  FVIF  method  stands 
alone  with  a  score  of  91.1  percent.  The  VDC  method 
had  the  second  highest  effectiveness  score  (71.1  per- 
cent). CORK  (57.8  percent),  PC  (60  percent),  and  RPC 
(60  percent)  form  the  third  group,  correctly  identifying 
an  average  of  59.3  percent  correct.  The  final  group 
consists  of  CLUSTER  (51.1  percent)  and  FA  (46.7  per- 
cent), for  an  average  of  only  48.9  percent  correct. 

Ideally  a  proxy-set  identification  method  should  be 
effective  regardless  of  the  composition  of  the  data  set. 
A  method  should  find  a  high  percentage  of  proxy  sets, 
and  it  should  identify  different  types  of  proxy  sets. 
Table  1  presents  the  effectiveness  scores  for  each  of 
the  seven  methods  when  applied  to  each  of  the  four 
data  sets:  A,  B,  C,  and  D.  Effectiveness  scores  for  in- 
dividual combinations  (of  method  and  data  set)  range 
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Table  1 — Effectiveness  scores  (percent  correct)  of  seven  proxy-set  identification  methods 


Data  set 

Proxv-set  size 

Non- 

Two 

Three 

Four 

A 
M 

B 

c 

n 

proxies 

variables 

variables 

If  oriole  lac 
VarlaDlcS 

Tntal 

—  Percent  correct  

CORR 

63.6 

77.8 

43.8 

55.6 

100.0 

58.3 

0 

0 

57.8 

IVIF 

100.0 

100.0 

100.0 

55.6 

100.0 

87.5 

87.5 

100.0 

91.1 

VDC 

63.6 

44.4 

100.0 

55.6 

100.0 

58.3 

62.5 

100.0 

71.1 

PC 

81.8 

55.6 

75.0 

11.1 

8.3 

75.0 

87.5 

100.0 

60.0 

RPC 

63.6 

77.8 

50.0 

55.6 

100.0 

62.5 

0 

0 

60.0 

FA 

45.5 

55.6 

25.0 

77.8 

16.7 

75.0 

12.5 

0 

46.7 

CLUSTER 

63.6 

66.7 

31.3 

55.6 

8.3 

75.0 

50.0 

0 

51.1 

'CORR  =  the  correlation  matrix  method;  IVIF  =  the  iterative  variance  inflation  factor  method;  VDC  =  the  variance  decomposition  method;  PC  = 
the  principal  components  method;  RPC  =  the  rotated  principal  components  method;  FA  =  the  factor  analysis  method;  CLUSTER  =  the  cluster 
analysis  method. 


from  several  perfect  scores  (100  percent)  for  the  IVIF 
and  VDC  methods  down  to  11  percent  for  the  PC 
method  appUed  to  data  set  D.  However,  most  methods 
had  difficulty  finding  the  correct  proxy  sets  in  data 
set  D,  which  contained  some  rather  weak  relationships 
within  the  proxy  sets.  Data  set  C,  which  included  the 
only  proxy  set  containing  four  variables,  also  presented 
problems  for  several  of  the  methods. 

The  four  data  sets  varied  in  their  composition  of 
one-,  two-,  three-,  and  four-variable  proxy  sets.  A 
method  that  identifies  complex  proxy  sets  should  per- 
form better  than  other  methods  on  data  consisting  of 
complex  proxy  sets.  However,  not  all  proxy  sets  are 
complex.  The  different  sizes  of  proxy  sets  in  data  sets 
A,  B,  C,  and  D  provide  the  variability  needed  to  indi- 
cate a  particular  method's  performance  under  differ- 
ent proxy-set  structures. 

Effectiveness  scores  varied  greatly  based  on  the  size 
of  the  proxy  set  (table  1).  Most  methods  identified 
two- variable  proxy  sets  correctly,  v^dth  scores  ranging 
from  58.3  percent  to  87.5  percent  effective.  If  readers 
inspect  only  the  two-variable  column  of  table  1,  they 
might  conclude  that  the  techniques  are  all  relatively 
successful.  However,  it  is  only  at  the  two-variable 
proxy-set  size  where  this  conclusion  can  be  reached. 
For  larger  proxy  sets,  effectiveness  ranged  from  0  to 
100  percent. 

The  rVIF  and  PC  methods  did  a  particularly  good 
job  of  identifying  the  three-  and  four-variable  proxy 
sets.  Although  VDC  identified  the  four-variable  set, 
it  had  some  difficulty  with  the  three-variable  sets. 
The  CLUSTER  method  did  a  fair  job  of  identifying 
the  three-variable  proxy  sets,  but  missed  the  single 
four-variable  proxy  set  entirely.  The  CORR  and  RPC 
methods  did  not  find  any  of  the  three-  or  four- variable 
proxy  sets.  Also,  the  FA  method  failed  to  find  the 
four- variable  set  and  did  poorly  on  the  three- variable 
proxy  sets  as  well. 


The  CORR,  IVIF,  VDC,  and  RPC  methods  found  all 
nonproxy  variables,  but  the  PC,  FA,  and  CLUSTER 
methods  missed  all  but  a  few  of  them.  Although  the 
PC  method  had  high  scores  for  the  three-  and  four- 
variable  proxy  sets,  it  failed  to  find  most  of  the  non- 
proxy  variables,  which  might  be  thought  of  as  simple 
proxy  sets.  Nevertheless,  nonproxy  variables  are  very 
important  when  interpreting  regression  coefficients, 
because  correct  identification  of  the  nonproxy  variables 
permits  relatively  clear  interpretation  of  the  regres- 
sion coefficients  associated  with  them. 

We  modeled  the  proportion  of  enumeration  sets 
identified  correctly,  using  both  "average  correlation" 
and  "number  of  variables  in  the  proxy  set"  as  the  ex- 
planatory variables  in  a  logistic  regression.  The  re- 
sults are  presented  in  table  2.  The  fitted  equations 
are  plotted  in  figure  2. 

The  rVIF,  VDC,  and  PC  methods  had  fairly  large 
probabilities  associated  with  the  fitted  slope  when 
average  correlation  was  used  as  a  predictor  (table  2). 
This  indicates  lack  of  conclusive  evidence  of  a  relation- 
ship between  proportion  of  enumeration  sets  identi- 
fied correctly  and  average  correlation  (a  measure  of 
the  strength  of  relationships  among  variables  in  the 
same  proxy  set).  This  is  shown  graphically  in  figure  2 
by  relatively  flat  curves  for  these  three  methods. 

The  other  four  proxy-set  identification  methods  dis- 
play strong  relationships  between  the  proportion  of 
enumeration  sets  identified  and  average  correlation. 

Only  the  FA,  RPC,  and  CLUSTER  methods  have 
probabilities  small  enough  to  indicate  detectable  rela- 
tionships between  the  proportion  of  enumeration  sets 
identified  correctly  and  the  number  of  variables  in  a 
proxy  set  (table  2).  Figure  3  contains  a  graphical  rep- 
resentation of  the  fitted  relationships.  Of  course,  the 
failure  to  find  such  a  relationship  may  be  the  result 
of  too  little  data.  We  had  only  a  single  four- variable 
proxy  set. 
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Table  2 — Results  of  logistic  regression  fitting  of  proportion  of  proxy  sets  identified  correctly  on  average  correlation  and  on 
number  of  variables  in  the  proxy  set 


Method' 

Average  correlation  model 

Number  of  variables  in  proxy  set  model 

Intercept 

Slope 

P-value^ 

Intercept 

Slope 

P-value* 

\J^Jc.  \  \ 

1.4 

0.2 

0.8286 

IVIF 

1.7 

.6 

.7980 

1.4 

.2 

.8286 

VDC 

.2 

.6 

.7321 

-.6 

.5 

.5248 

PC 

2.7 

-3.0 

.1724 

-1.4 

1.2 

.2909 

RPC 

-12.6 

24.3 

.0076 

3.8 

-1.3 

.0776 

FA 

-5.3 

11.7 

.0037 

6.6 

-2.9 

.0127 

CLUSTER 

-1.4 

4.7 

.0212 

3.8 

-1.3 

.0776 

'CORR  =  the  correlation  matrix  method;  IVIF  =  the  iterative  variance  inflation  factor  method;  VDC  =  the  variance  decomposition  method; 
PC  =  the  principal  components  method;  RPC  =  the  rotated  principal  components  method;  FA  =  the  factor  analysis  method;  CLUSTER  =  the 
cluster  analysis  method. 

^P-value  associated  with  the  test  on  the  slope  of  the  logistic  regression. 


DISCUSSION 

The  methods  used  to  identify  proxy  sets  are  valuable 
because  they  flag  potential  problems  when  interpret- 
ing multiple  linear  regression  coefficients.  An  ideal 
method  would  have  to  be  totally  effective  (identifying 
every  real  proxy  correctly).  Unfortunately,  none  of 
the  methods  we  evaluated  met  that  standard. 

Any  one  of  the  seven  methods  could  be  useful  if  it 
were  the  only  one  available.  Therefore,  we  do  not  rec- 
ommend discarding  any  of  the  seven  methods.  How- 
ever, some  of  the  methods  are  of  limited  value  in 


identifying  proxy  sets  containing  weakly  related  vari- 
ables. The  data  analyst  should  use  the  best  technique 
available. 

When  interpreting  regression  coefficients,  it  would 
be  a  serious  error  to  treat  a  variable  that  is  a  member 
of  a  proxy  set  as  if  it  were  a  nonproxy,  independent 
of  all  other  variables.  Without  any  analysis  to  detect 
proxy  sets,  all  variables  would  be  regarded  as  non- 
proxies  and  would  be  interpreted  accordingly.  We  feel 
analysis  with  any  of  the  seven  methods  is  better  than 
no  analysis  at  all,  but  some  caution  must  be  exercised 
when  using  any  of  the  methods. 


0      0.1     0.2     0.3     0.4     0.6     0.6     0.7     0.8     0.9  1 


Average  Correlation 

Figure  2— Plotted  relationships  from  fitted 
logistic  regressions  of  proportion  of  proxy 
sets  correctly  identified  on  average  correla- 
tion, for  seven  identification  methods:  the 
correlation  matrix  method  (CORR),  the  itera- 
tive variance  inflation  factor  method  (IVIF), 
the  variance  decomposition  method  (VDC), 
the  principal  components  method  (PC),  the 
rotated  principal  components  method  (RPC), 
the  factor  analysis  method  (FA),  and  the 
cluster  analysis  method  (CLUSTER). 


Number  of  Variables 

Figure  3 — Plotted  relationships  from  fitted  lo- 
gistic regressions  of  proportion  of  proxy  sets 
correctly  identified  on  number  of  variables  in 
the  proxy  set,  for  seven  identification  methods: 
the  correlation  matrix  method  (CORR),  the  it- 
erative variance  inflation  factor  method  (IVIF), 
the  variance  decomposition  method  (VDC),  the 
principal  components  method  (PC),  the  rotated 
principal  components  method  (RPC),  the  factor 
analysis  method  (FA),  and  the  cluster  analysis 
method  (CLUSTER). 


7 


Consider  the  explanatory  variables,  WETLEN  and 
DRYLEN,  addressed  earlier.  We  began  by  discussing 
a  situation  with  both  variables  in  the  model,  where 
neither  was  statistically  significant.  If  either  one  were 
included  without  the  other,  however,  the  remaining 
variable  was  highly  significant.  Analysis  with  IVIF 
indicated  that  these  variables  were  members  of  the 
same  proxy  set,  which  we  might  call  "length."  Tradi- 
tional diagnostics  on  a  final  regression  model  contain- 
ing either  of  these  variables  may  well  have  concluded 
"no  collinearity  problem,"  leaving  the  analyst  fi-ee  to 
interpret  the  coefficient  without  restraint.  But  tradi- 
tional collinearity  diagnostics  fail  to  detect  proxies 
that  are  not  included  in  the  model.  Most  of  the  seven 
proxy-set  identification  methods  we  used  identified 
WETLEN  and  DRYLEN  as  members  of  the  same  proxy 
set.  Such  identification  of  an  existing  proxy  set  should 
cause  us  to  exercise  caution  when  interpreting  the 
variables  that  are  included  in  the  model  and  to  con- 
sider their  relationships  with  their  proxies  that  are 
not  included  in  the  model. 

Analysis  of  proxy-set  membership  should  take  place 
before  attempting  to  build  the  model,  and  should  be- 
come an  integral  part  of  the  overall  model-building 
process.  For  example,  knowledge  of  pro^-set  member- 
ship might  permit  an  analyst  to  include  only  a  single 
member  of  each  proxy  set  in  the  model.  In  fact,  this 
may  be  the  most  common  course  to  follow.  However, 
not  all  variables  in  a  proxy  set  perform  equally  well 
in  a  regression  model  because  each  variable  probably 
conveys  only  part  of  the  information  conveyed  by  the 
other  members  of  the  same  set;  therefore,  the  choice 
of  which  member  (or  members)  of  the  proxy  set  to  in- 
clude in  the  model  is  an  important  one.  Such  a  choice 
may  have  to  be  based  on  non-statistical  considerations, 
such  as  the  availability  of  measurements,  the  cost  of 
obtaining  measurements,  and  so  forth.  If  it  is  neces- 
sary to  include  more  than  one  variable  fi-om  the  same 
proxy  set  in  a  regression  model,  the  analyst  may  wish 
to  use  a  biased  regression  procedure  such  as  ridge  re- 
gression, first  proposed  by  Hoerl  (1962).  Further  work 
by  Hoerl  and  Kennard  (1970)  and  work  in  forestry  by 
Bare  and  Hann  (1981)  make  this  technique  a  useful 
tool  once  the  proxy  sets  have  been  identified. 

Of  course,  once  the  coefficients  have  been  estimated, 
the  analyst  can  only  test  for  collinearity  among  those 
variables  that  were  included  in  the  final  model.  Though 
important,  that  type  of  postfitting  diagnostic  analysis 
cannot  provide  insight  beyond  those  variables  that 
were  included  in  the  final  version.  Traditional  col- 
linearity diagnostic  procedures,  appUed  to  the  final 
model,  are  totally  ineffectual  in  identifying  proxy-set 
variables  that  are  not  in  the  model.  Yet,  variables  that 
are  not  in  the  model  are  often  implicitly  interpreted 
as  having  no  importance  in  understanding  the  behav- 
ior of  the  response  variable,  an  interpretation  that 
may  be  totally  incorrect. 


This  article  focuses  on  proxy  sets  as  they  affect  the 
interpretation  of  regression  coefficients.  Proxy-set 
membership  should  be  analyzed  whenever  regression 
coefficients  are  interpreted  or  the  explanatory  variable 
of  concern  is  a  poUcy-related  variable  intended  to  be 
manipulated.  For  example,  econometric  analyses  (such 
as  supply-demand  modeling)  commonly  interpret  co- 
efficients and  then  make  policy  recommendations  based 
on  the  interpretation.  Similarly,  growth  and  yield 
models  often  use  explanatory  variables  that  could  be 
members  of  proxy  sets,  but  are  sometimes  treated  as 
if  they  were  not.  The  obvious  danger  of  incorrectly 
acting  as  if  a  variable  were  a  nonproxy  is  that  result- 
ant policy  actions  may  well  fail,  or  "scientific  knowl- 
edge" allegedly  gained  may  be  misleading  or  erroneous. 

Figure  2  illustrates  the  strengths  and  weaknesses 
of  the  seven  proxy-set  identification  methods.  The  IVIF 
method  is  clearly  the  best  choice  because  it  correctly 
identifies  a  high  proportion  of  the  enumeration  sets, 
regardless  of  the  strength  of  relationships  within  the 
proxy  set  (as  measured  by  average  correlation).  The 
VDC  method  also  does  quite  well.  The  PC  method  does 
better  when  there  are  weak  relationships  in  the  proxy 
set  than  when  there  are  strong  ones.  The  CLUSTER 
method  does  better  identifying  strong  relationships 
than  weak  relationships.  The  FA,  RPC,  and  CORR 
methods  do  poorly  identifying  weak  relationships,  but 
well  identifying  strong  ones.  This  information  may 
help  when  selecting  a  method  that  might  be  suitable 
for  a  specific  application. 

The  IVIF  method  performs  well  over  the  entire  range 
of  average  correlation.  It  also  has  the  advantage  of 
being  widely  available  because  any  regression  program 
that  computes  VIF  can  be  used  interactively  to  produce 
the  desired  results.  A  step-by-step  procediire  for  identi- 
fying proxy  sets  and  using  them  in  interpretation  of  co- 
efficients in  multiple  linear  regression  is  presented  in 
the  next  section. 

A  SUGGESTED  PROCEDURE 

We  recommend  the  following  7-step  procedure  for 
using  proxy  sets  to  help  interpret  the  coefficients  in 
multiple  linear  regression: 

1.  Obtain  data  on  aU  variables  that  are  possible 
candidates  for  inclusion  as  explanatory  variables  in 
the  final  multiple  linear  regression  model. 

2.  Identify  all  proxy  sets  among  the  candidate  vari- 
ables in  step  1. 

3.  Choose  one  variable  fi-om  each  proxy  set  identi- 
fied in  step  2.  Each  such  variable  is  the  initial  repre- 
sentative of  its  proxy  set.  This  choice  can  be  made  on 
practical  grovmds;  we  may  wish  to  choose  the  variable 
that  is  the  most  easily  available,  the  most  inexpen- 
sive, or  that  is  available  in  the  most  timely  manner. 

4.  Use  the  customary  model-building  techniques 
on  the  representatives  of  each  proxy  set  along  with 


8 


the  nonproxy  variables.  This  produces  the  basic 
model. 

5.  If  the  coefficient  of  the  representative  variable 
from  any  proxy  set  was  not  statistically  significant 
when  tried  in  the  model,  attempt  to  enter  a  different 
member  of  the  same  proxy  set  instead.  If  no  member 
variable  of  a  certain  proxy  set  produces  a  statistically 
significant  coefficient,  perhaps  that  proxy  set  does  not 
help  describe  the  response  variable  and  does  not  need 
to  be  represented  in  the  model.  Continue  this  proce- 
dure until  all  proxy  sets  or  nonproxy  variables  have 
had  a  chance  to  be  represented  in  the  model. 

6.  Now  try  to  enter  additional  variables  fi-om  the 
proxy  sets  to  see  if  the  fit  improves.  If  a  second  (or 
even  third)  variable  fi*om  the  same  proxy  set  produces 
a  statistically  significant  coefficient,  decide  whether 
this  new  variable  should  become  a  part  of  the  model. 
If  you  decide  to  include  one  or  more  variables  of  this 
type,  final  interpretation  of  the  coefficients  must  take 
the  duphcate  nature  of  these  variables  into  account. 
As  a  duplicate  variable  is  introduced  into  the  model, 
all  its  proxies  that  were  already  in  the  model  appear 
less  important. 

7.  Consider  the  proxy  sets  when  interpreting  the 
coefficients  of  the  final  regression  model.  Each  vari- 
able in  the  model  represents  not  only  itself  but  also 
all  other  members  of  its  proxy  set.  Therefore,  the  ab- 
sence of  a  particular  variable  in  the  final  model  may 
not  mean  it  is  unimportant.  Its  absence  may  only 
mean  that  it  is  represented  in  the  model  by  a  proxy, 
another  member  of  the  same  proxy  set. 
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APPENDIX  A:  CHOOSING  CUTOFF  VALUES  FOR  IDENTIFYING  PROXIES 


With  correlation  coefficients  and  some  other  meth- 
ods we  chose  0.32  (=  Vo!l)  as  a  cutoff  because  it  repre- 
sents an     of  only  10  percent.  We  felt  that  anything 
smaller  would  not  represent  a  relationship  of  siifficient 
strength  to  be  worth  consideration.  However,  we  did 
not  choose  a  cutoff  value  for  any  identification  method 


imless  it  produced  about  as  good  a  classification  into 
proxy  sets  as  was  achievable  with  that  method.  Of 
course,  users  of  any  of  the  methods  discussed  here  may 
wish  to  choose  other  nimierical  cutoff  values  that  are 
more  meaningful  to  them. 


APPENDIX  B:  THE  ITERATIVE  VARIANCE  INFLATION  FACTOR  METHOD  (IVIF) 


The  rVIF  method  requires  that  the  variance  inflation 
factors  (VIF's)  be  examined  after  each  new  explana- 
tory variable  is  entered  in  the  model.  If  the  addition 
of  a  new  variable  causes  a  dramatic  increase  (to  a  value 
greater  than  1.5)  in  the  VIF's  of  some  other  variable 
or  variables  already  in  the  model,  the  new  variable 
and  the  variables  with  the  increased  VIFs  are  coUinear 
and  thus  belong  to  the  same  proxy  set. 

If  the  most  recently  entered  variable  causes  the 
VIF  of  other  variables  already  in  the  model  to  jump, 
it  should  be  removed  immediately.  Then  the  next 
variable  under  consideration  shoiild  be  entered.  This 
procedure  of  entering  and  possibly  removing  variables 
is  followed  until  the  analyst  has  attempted  to  enter 
each  of  the  candidate  variables  in  the  model.  After 
all  variables  have  been  entered  into  the  model  one 
time  and  some  have  been  removed,  all  variables  still 
included  in  the  model  should  have  VIF's  between  1.0 
to  1.5.  From  this  point,  each  entry  of  one  of  the  pre- 
viously removed  variables  will  cause  a  jump  in  the 
VTF's  of  some  variables  already  in  the  model.  Those 
variables  that  experience  VIF  jumps  are  coUinear  with 
the  variable  just  entered.  Proxy  sets  containing  more 
than  two  explanatory  variables  are  easily  identified 
in  this  way. 

A  simple  algorithm  allows  straightforward  identifi- 
cation of  proxy  sets — even  when  they  contain  more 
than  two  explanatory  variables.  The  algorithm  foUows: 

Step  lA:  Fit  the  model  with  only  the  first  candidate 
explanatory  variable.  This  wiQ  always  produce  a  single 
VIF  of  1.0. 

Step  IB:  Add  the  next  candidate  variable  unless 
there  are  no  others,  in  which  case  go  to  step  3A. 

Step  2A:  If  there  was  no  marked  increase  in  any 
VIF,  say  to  a  value  greater  than  1.5,  then  return  to 
step  IB. 

Step  2B:  If  there  was  a  jump  in  any  of  the  previous 
VIF's,  remove  the  last  variable  entered  in  step  IB.  All 
variables  in  the  model  should  then  have  VIFs  near  1.0. 
The  variable  that  was  just  removed  belongs  to  the 
same  proxy  set  as  the  previous  variable  or  variables 
whose  VIF  jumped  in  size.  Two  members  of  a  proxy 
set  have  now  been  identified.  Return  to  step  IB. 


Step  3A:  At  this  point,  all  variables  should  have  been 
entered  into  the  model  once.  All  variables  stiU  in  the 
model  should  have  smaU  VIFs — say  less  than  about 
1.5.  Each  of  the  variables  that  was  entered  into  the 
model  and  then  removed  will  belong  to  some  proxy 
set.  You  should  keep  a  list  of  all  such  sets. 

Step  3B:  Reenter  the  first  variable  that  was  removed 
from  the  model.  This  should  cause  an  immediate  in- 
crease in  the  VIF  of  the  variable  that  has  already  been 
identified  as  belonging  to  the  same  proxy  set  as  the  one 
just  reentered.  Note  that  other  variables  in  the  model 
may  also  display  a  jump  in  their  VIF.  They  belong  to 
the  same  proxy  set  as  the  ones  identified  before.  It 
is  this  reentry  procedvire  that  allows  identification 
of  proxy  sets  with  more  than  two  members. 

Step  3C:  Repeat  step  3B  \mtil  aU  previously  removed 
variables  have  been  reentered  into  the  model.  There 
may  now  be  several  large  VIFs  when  aU  variables  are 
in  the  model,  but  you  should  already  have  identified 
the  proxy  sets. 

An  Example 

An  example  wiU  illustrate  use  of  the  IVIF  method. 
The  data  consist  of  25  observations  on  11  candidate 
explanatory  variables  and  a  response  variable,  total 
biomass.  The  data  are  presented  in  table  3. 

The  only  thing  required  to  use  the  IVIF  method  is 
any  computer  program  that  prints  the  VIFs  after  the 
model  is  fitted.  Table  4  indicates  that  a  total  of  15 
passes  of  the  regression  program  are  required  to  ob- 
tain the  desired  result.  Actually,  fewer  than  15  passes 
are  required  when  one  has  become  familiar  with  the 
process,  but  to  avoid  confusion  for  someone  learning 
the  method  we  recommend  that  all  15  passes  be  used. 
Because  this  process  is  iterative,  a  completely  interac- 
tive computer  program  makes  the  work  much  simpler. 

When  we  use  this  method,  we  add  (or  remove)  an 
explanatory  variable  on  each  pass.  We  can  introduce 
the  explanatory  variables  in  whatever  order  is  conve- 
nient. We  chose  to  add  the  variable,  leaf  density,  first. 
Therefore,  on  pass  1  we  fit  a  regression  with  only  a 
single  explanatory  variable,  leaf  density.  Whatever 
variable  we  choose  to  fit  first,  the  VIF  of  that  single 
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Table  3 — The  hypothetical  data  used  to  illustrate  the  iterative  variance  inflation  factor  method  of  identifying  proxy  sets  and  groups  of  collinear 
explanatory  variables  with  total  biomass  as  the  response  variable 


Leaf 

Stem 

Stem 

Growth 

Crown 

Percent 

Leaf 

Time  to 

Firm- 

Infrared 

Ratio 

Total 

No. 

density 

length 

diameter 

rate 

diameter 

sucrose 

length 

2-cm  height 

ness 

reflectance 

Ca/N 

biomass 

1 

6.9 

11.64 

7.3 

3.0 

3.2 

7.8 

5.35 

2.70 

6.41 

4.5 

1.5 

27.6 

2 

.6 

11.39 

6.4 

3.1 

4.1 

8.3 

5.08 

2.59 

6.10 

4.5 

.8 

23.3 

3 

6.8 

9.56 

5.1 

4.7 

3.0 

8.8 

5.28 

1.49 

6.04 

4.1 

1.8 

24.7 

4 

5.1 

11.93 

7.4 

3.0 

2.9 

7.4 

5.33 

3.06 

6.45 

3.6 

1.2 

26.8 

5 

2.5 

9.28 

3.5 

4.6 

2.5 

9.1 

4.76 

1.17 

5.67 

4.1 

.9 

19.6 

6 

7.5 

8.64 

.5 

4.2 

3.5 

8.6 

3.40 

1.85 

6.18 

4.3 

.8 

25.6 

7 

2.6 

8.78 

3.0 

4.0 

3.1 

5.7 

4.74 

2.01 

5.98 

4.0 

1.3 

19.9 

8 

5.6 

11.01 

6.2 

2.9 

2.1 

8.3 

5.15 

3.56 

5.53 

4.1 

1.0 

25.3 

9 

6.1 

11.25 

6.4 

1.1 

3.6 

7.4 

5.14 

4.68 

5.76 

4.1 

1.2 

30.8 

10 

.4 

11.06 

5.5 

1.7 

2.8 

8.7 

4.70 

4.10 

6.13 

3.7 

1.7 

21.9 

11 

7.1 

9.26 

3.7 

3.0 

3.2 

8.2 

4.73 

2.75 

5.67 

4.0 

1.5 

25.9 

12 

5.9 

9.53 

4.2 

2.3 

2.4 

7.8 

4.93 

3.70 

5.92 

4.6 

1.1 

25.9 

13 

3.1 

8.21 

3.8 

3.0 

2.7 

7.5 

5.24 

2.91 

6.27 

4.0 

.7 

21.1 

14 

7.7 

9.76 

4.9 

1.7 

3.8 

8.3 

5.03 

4.23 

6.01 

3.9 

.5 

28.6 

15 

4.1 

7.29 

1.7 

2.6 

3.1 

6.6 

4.71 

3.21 

5.95 

3.8 

1.1 

21.9 

16 

4.8 

8.59 

4.2 

1.9 

2.8 

10.0 

5.13 

3.68 

6.05 

3.6 

.8 

23.0 

17 

5.3 

6.24 

1.4 

1.3 

2.6 

8.3 

5.15 

4.83 

6.24 

4.2 

1.1 

22.4 

18 

5.2 

10.01 

5.2 

3.5 

3.1 

7.5 

5.30 

2.29 

5.78 

3.9 

.8 

25.1 

19 

5.2 

11.69 

5.2 

3.8 

2.4 

8.3 

4.21 

2.22 

6.71 

4.0 

.5 

25.4 

11.10 

C  A 
D.4 

o  c 
o.o 

o.L) 

0  fits 

D.UD 

0.*r 

21 

5.6 

11.65 

7.9 

2.1 

2.6 

7.8 

5.58 

4.03 

6.07 

4.2 

.4 

27.2 

22 

4.6 

11.57 

3.5 

2.6 

3.2 

7.8 

3.52 

3.83 

5.54 

3.6 

.7 

26.9 

23 

7.0 

9.64 

5.9 

3.7 

2.9 

6.7 

5.60 

2.15 

6.05 

3.5 

.7 

24.2 

24 

4.1 

9.01 

7.1 

4.4 

3.7 

8.8 

6.49 

1.55 

5.84 

3.9 

.6 

22.9 

25 

5.2 

11.89 

7.3 

3.2 

2.5 

7.5 

5.22 

2.66 

5.59 

4.3 

.8 

26.9 

variable  will  always  be  1.0.  We  have  just  completed 
step  lA. 

Suppose  we  choose  to  enter  stem  length  in  pass  2. 
From  table  4,  we  see  that  the  VIF  for  leaf  density  is 
unaffected  by  the  introduction  of  stem  length.  There- 
fore the  two  variables  do  not  belong  to  the  same  proxy 
set.  Note,  however,  that  this  does  not  imply  that  either 
of  these  variables  is  free  of  proxy-set  membership  with 


other  variables.  We  have  now  completed  step  2A, 
and  we  must  return  to  step  IB  to  enter  the  next 
variable. 

When  stem  diameter  is  entered  in  pass  3,  we  note 
an  immediate  jump  in  the  VIF  for  stem  length.  This 
indicates  that  stem  diameter  and  stem  length  are 
members  of  the  same  proxy  set.  This  is  step  2B  and 
we  must  remove  the  most  recently  added  variable, 


Table  4 — Pass-by-pass  results  (VlF's)  of  the  iterative  variance  inflation  factor  method  used  to  determine  proxy  sets  for  variables  in  table  3 


Leaf 

Stem 

Stem 

Growth 

Crown 

Percent 

Leaf 

Time  to 

Firm- 

Infrared 

Ratio 

Pass 

density 

length 

diameter 

rate 

diameter 

sucrose 

length 

2-cm  height 

ness 

reflectance 

Ca/N 

1 

1.0 

2 

1.0 

1.0 

3 

1.0 

2.5 

2.5 

4 

1.0 

1.0 

5 

1.0 

1.0 

1.0 

6 

1.0 

1.0 

1.0 

1.0 

7 

1.0 

1.0 

1.0 

1.0 

1.0 

8 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

9 

1.1 

1.0 

27.4 

1.1 

1.0 

1.1 

27.8 

10 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

11 

1.1 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

12 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

13 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

14 

1.0 

158.1 

264.1 

1.0 

1.0 

1.2 

106.7 

1.1 

1.1 

1.1 

15 

1.1 

161.9 

269.5 

29.6 

1.1 

1.2 

108.2 

29.7 

1.2 

1.1 

1.1 

11 


stem  diameter,  from  the  model.  Thus,  pass  4  produces 
the  same  results  as  pass  2. 

Passes  5  to  8  identify  no  proxy  sets.  Pass  5  intro- 
duces growth  rate  into  the  model.  Because  no  VIF's 
jump,  we  are  at  step  2A  again  and  must  go  to  step  IB 
to  enter  another  variable.  Pass  6  enters  crown  diam- 
eter into  the  model.  Again,  no  VIF's  jump,  so  we  can 
enter  another  variable.  Pass  7  introduces  percent  su- 
crose, without  causing  any  VIF's  to  jump.  This  means 
percent  sucrose  is  not  a  proxy  of  any  variables  already 
in  the  model.  Likewise,  pass  8  enters  leaf  length  with- 
out a  jump  in  any  VIF. 

Pass  9  introduces  time  to  2-cm  height  (the  time  it 
takes  seedlings  to  grow  to  a  height  of  2  cm)  into  the 
model,  and  we  see  an  immediate  jump  in  the  VIF  of 
growth  rate.  We  have  now  identified  another  proxy 
set  that  contains  at  least  the  two  variables,  growth 
rate  and  time  to  2-cm  height.  We  are  at  step  2B  again 
and  must  remove  the  last  variable  entered,  namely 
time  to  2-cm  height.  The  result  is  pass  10. 

Passes  11  to  13  identify  no  proxy  sets.  Pass  11  adds 
the  variable,  firmness,  to  the  model  with  no  jump  in  the 
VIF  for  any  variable  already  in  the  model.  Neither  in- 
frared reflectance  in  pass  12  nor  the  calcium-to-nitrogen 
ratio  in  pass  13  caused  a  jump  in  any  VIF.  Because  we 
have  now  entered  all  candidate  variables  once,  we  are 
at  step  IB,  which  directs  us  to  go  to  step  3A. 

We  have  removed  two  VEiriables  from  the  model:  stem 
diameter  (which  we  know  is  a  proxy  of  stem  length) 
and  time  to  2-cm  height  (which  we  know  is  a  proxy  of 
growth  rate).  We  are  now  at  step  3B  and  mvist  reenter 
the  first  variable  that  was  removed  from  the  model, 
stem  diameter.  When  we  do  so  (pass  14),  the  VIF's 


of  both  stem  length  and  leaf  length  jump.  Therefore, 
we  have  identified  a  proxy  set  containing  three  ex- 
planatory variables:  stem  diameter,  stem  length,  and 
leaf  length. 

We  are  now  at  step  3C,  and  we  reenter  the  remaining 
variable  that  we  removed  earlier,  time  to  2-cm  height 
(pass  15).  Upon  entry  into  the  model,  this  variable 
caused  a  jiunp  in  only  a  single  variable,  growth  rate. 
Therefore,  this  is  a  second  proxy  set  containing  only 
two  variables:  time  to  2-cm  height  and  growth  rate. 

Note  the  important  difference  between  pass  14  and 
pass  15.  In  pass  14,  a  jump  occurred  in  the  VIF's 
of  two  variables,  while  in  pass  15  only  a  single  VIF 
increased. 

This  method,  the  iterative  variance  inflation  factor 
method,  is  dynamic  and  allows  identification  of  proxy 
sets  of  more  than  just  two  members.  If  followed  sys- 
tematically, it  is  an  effective  means  of  identifying 
proxies. 

Here  are  a  few  time-saving  suggestions.  Because 
pass  1  will  always  yield  a  VIF  of  1.0,  it  is  unnecessary. 
We  can  start  with  pass  2,  entering  two  variables  ini- 
tially and  remembering  that  if  the  resulting  VIF's 
are  large  (that  is  greater  than  1.5),  the  two  variables 
should  be  considered  proxies  and  one  of  the  variables 
should  be  removed  from  the  model.  Note  also  that 
pass  4  is  identical  to  pass  2  and  pass  10  is  identical  to 
pass  8.  Therefore,  we  could  have  gone  directly  from 
pass  3  to  pass  5  by  removing  stem  diameter  and  add- 
ing growth  rate  in  a  single  pass.  Likewise,  we  could 
have  gone  directly  from  pass  9  to  pass  11.  These  sug- 
gestions could  have  reduced  the  number  of  passes 
needed  from  15  down  to  12. 
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Introduced  here  is  the  concept  of  a  proxy  set,  defined  to  be  a  collection  of  potential  ex- 
planatory variables  linearly  related  to  one  another.  Therefore,  each  member  of  the  proxy 
set  conveys  some,  and  perhaps  much,  of  the  same  information  as  other  members  of  the 
same  proxy  set  if  they  are  included  in  a  multiple  linear  regression  together.  Interpreting  a 
coefficient  in  a  multiple  linear  regression  equation  can  be  misleading  if  proxy-set  member- 
ship is  ignored,  even  if  the  final  regression  model  does  not  include  some  of  the  variables  in 
the  proxy  set.  This  study  compares  the  effectiveness  of  seven  different  methods  of  identi- 
fying proxy  sets. 
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